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Abstract. Sequence assembly from short reads is an important prob- 
lem in biology. It is known that solving the sequence assembly problem 
exactly on a bi-directed de Bruijn graph or a string graph is intractable. 
However finding a Shortest Double stranded DNA string (SDDNA) con- 
taining all the fc-long words in the reads seems to be a good heuristic 
to get close to the original genome. This problem is equivalent to find- 
ing a cyclic Chinese Postman (CP) walk on the underlying un-weighted 
bi-directed de Bruijn graph built from the reads. The Chinese Postman 
walk Problem (CPP) is solved by reducing it to a general bi-directed 
flow on this graph which runs in 0(|i5p log^(|\/|)) time. 

In this paper we show that the cyclic CPP on bi-directed graphs can be 
solved without reducing it to bi-directed flow. We present a 0(^(11^1 + 
log(|\/|) + (dmaxp)^) time algorithm to solve the cyclic CPP on 
a weighted bi-directed de Bruijn graph, where p = max{|{t;|cii„(ii) — 
doutiv) > 0}\,\{v\din{v) - dout{v) < 0}|} and dmax = max{|di„(«) - 
rfout(w)}. Our algorithm performs asymptotically better than the bi- 
directed flow algorithm when the number of imbalanced nodes p is much 
less than the nodes in the bi-directed graph. From our experimental re- 
sults on various datasets, we have noticed that the value of p/|y| lies 
between 0.08% and 0.13% with 95% probability. 

Many practical bi-directed de Bruijn graphs do not have cyclic CP walks. 
In such cases it is not clear how the bi-directed flow can be useful in 
identifying contigs. Our algorithm can handle such situations and identify 
maximal bi-directed sub-graphs that have CP walks. A ©(pdl^l + \E\)) 
time heuristic algorithm based on these ideas has been implemented for 
the SDDNA problem. This algorithm was tested on short reads from a 
plant genome and achieves an approximation ratio of at most 1.0134. We 
also present a ©((IV^I + \E\) log(V)) time algorithm for the single source 
shortest path problem on bi-directed de Bruijn graphs, which may be of 
independent interest. 



1 Introduction 



Sequencing the human genome was one of the major scientific breakthroughs 
in the last seven years. Analysis of the sequenced genome can give us vital 
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information about the expression of genes, which in turn can help scientists 
to develop drugs for diseases. Thus sequencing the genome of an organism is 
of fundamental importance in both medicine and biology. Unfortunately the 
technology used in major human genome sequencing projects - Human Genome 
Project (HGP) [1] and Celera [2], was too expensive to be adopted in a large scale. 
This led to the research on next-generation sequencing methods. Pyro- sequencing 
technologies such as SOLID, 454 and Solexa generate a large number of short 
reads which are very accurate. Directed de Bruijn graph based sequence assembly 
algorithms such as [3] and [4] seem to handle these short read data efficiently 
compared to the string graph based algorithms (sec e.g., [5]). Unfortunately 
solving the sequence assembly problem exactly on both these graph models seems 
intractable [6] . However heuristics such as finding a shortest string which includes 
all the fc-mers (sub strings of length k) seem to yield results close to the original 
genome. In the case of directed de Bruijn graphs finding an Eulerian tour seems 
to yield good results. If the graph is not Eulerian then a Chinese Postman 
(CP) tour has been suggested in [4]. To account for the double strandedness of 
the DNA molecule we need to simultaneously search for two complimentary CP 
tours. In [6] the directed de Bruijn graphs are replaced with bi-directed de Bruijn 
graphs to find two complimentary CP tours simultaneously. A CP tour on the 
un-weighted bi-directed graph constructed from the reads serves as a solution 
to the Shortest Double Stranded DNA string (SDDNA) problem. The solution 
presented in [6] solves the SDDNA problem by reducing it to a general weighted 
bi-directed flow problem. This algorithm runs in 0{\E\'^ log^iV)) time. 

In this paper wc present algorithms for SDDNA/CPP on bi-directed de 
Bruijn graphs without using a bi-directed flow algorithm. Our algorithms are 
based on identifying shortest bi-directed paths and use of weighted bi-partite 
matching. Our algorithms perform asymptotically better than the bi-directed 
flow algorithm when the imbalanced nodes in the bi-directed graphs are much 
smaller in number than \V\. This restriction seems to be true in practice from 
what we have observed in our experiments. On the other hand it turns out that 
in many practical situations these bi-directed de Bruijn graphs fail to have cyclic 
CP tours. In these cases it is not clear how the bi-directed flow algorithm [6] can 
help us in identifying a set of contigs covering every fc-long word at least once. 
In contrast to this flow algorithm, our algorithm can be useful in obtaining a 
minimal set of contigs when a cyclic CP tour does no exist. We now summa- 
rize our results as follows. Firstly our deterministic algorithm to solve the cyclic 
CPP on a general bi-directed graph takes 6>(j3(|y| -|- jE'l) log(|y|) -|- {dmaxP)^) 
time, where dmax = ma,x{\din{v) - dout{v)\,v e V}, p = max{\V+\,\V~\}, 
V+ = {v\v e V,d„i{v) - dout > 0} and V' = {v\v G V,d^n{v) - dout < 0}. 
Secondly we solve the SDDNA problem on an un-weighted bi-directed de Bruijn 
graph deterministically in 0(p(|F| -|- -|- {dmax p)^) time. As a consequence we 
also present a (9((|T^| + \E\) \og{V)) time single source shortest bi-directed path 
algorithm, which may be of independent interest to some assembly algorithms 
such as Velvet [3] - TourBus heuristic. 
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The organization of the paper is as follows. In Section 2 we provide some 
preliminaries. Section 3 defines the CPP and SDDNA problems. In Section 4 
we introduce our algorithm for single source shortest bi-directed paths, which is 
used as a component in our main algorithm. The main algorithm is introduced 
in Section 7 along with algorithms for several sub-problems. Section 9 briefly 
explains how we can handle situations when the bi-directcd graphs do not have 
cyclic CP tours. A greedy algorithm that runs in ©(j^dV^I + \E\) time is described 
in Section 8. Finally experimental studies are reported in Section 10. 



2 PreliminEiries 

Let s G 17" be a string of length n. Any substring Sj (i.e., s[j, . . . j + k — l],n — 
fc + 1 > j > 1) of length k is called a fc— mer of s. The set of all A;— mer's of a 
given string s is called the A;— spectrum of s and is denoted by S(s, k). Given a 
fc— mer Sj, Sj denotes the reverse compliment of Sj (e.g., if Sj = AAGTA then 
Sj = TACTT). Let < be the partial ordering among the strings of equal length, 
then Sj < Sj indicates that string Sj is lexicographically smaller than Sj. Given 
any fc— mcr s^, let si be the lexicographically smaller string between Si and Si. 
We call Si the canonical /c— mer of s^. More formally, if Sj < Sj then Sj = Sj else 
Si = Si. A fc— molecule of a given fc— mer Sj is a tuple (sj, si) consisting of Sj 
and its reverse compliment si, the first entry in this tuple is called the positive 
strand and the second entry is called the negative strand. 

A bi-directed graph is a generalized version of a standard directed graph. In 
a directed graph every edge ( -[> or <]-) has only one arrow head. On the other 
hand, in a bi-directed graph every edge (<l-l>, <1-<1,I>-<1 or l>-|>) has two arrow 
heads attached to it. Formally, let V be the set of vertices of a bi-directed graph, 
E = {{vi.Vj, oi,02)\vi,Vj G V Aoi,02 G {<, o}} is the set of bi-dircctcd edges in 
a bi-directed graph G{V, E). A walk w{vi,Vj) between two nodes Vi, Vj £ V oi a 
bi-directed graph G{V,E) is a sequence Vi,ei^, Vi^ , Ci^ ^Vi^ . . . vt^ , e^^^ ^ , Vj , such 
that for every intermediate vertex Ui, , 1 < ? < m, the orientation of the arrow 
heads on either side is opposite. To make this more clear let e^, , ii^, , 6^,^^ be the 
sub-sequence in the walk w{vi,Vj), e,, = , Vi^ , 01,02), ei,_^i = {vi^ , Vi^^^ , 01,02) 
then for the walk to be valid e^, .02 = e^j^^.O!. If Vj = Vi and eij.oi = ei^^j.02 
then the walk is called cyclic. A walk on the bi-directed graph is referred to as 
a bi-directed walk. We define a orientation function O : V'^ {>,<}^ which 
gives the orientation of the bi-dircctcd edge between a pair of vertices if one 
exists . For instance if {vi,Vj, <},>) is a bi-directed edge between Vi and Vj then 
0{vi,Vj) = <-!>. An edge which is adjacent on a vertex with a orientation 
O (<l) is called an incoming (outgoing) edge. The incoming (outgoing) degree 
of a vertex v is denoted by din{v) {dout{v)). A vertex v is called balanced iff 
dini^) — dout{v) = 0. A vertex is called imbalanced iff \din{v) — dout{v)\ > 0. 
The imbalance of a vertex is called positive iff din{v) — dout{v) > 0. Similarly 
a vertex is negative imbalanced iff din{v) — dout{i>) < 0. A bi-directed graph is 
called connected iff every pair of vertices have a bi-directed walk between them. 
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A de Bruijn graph D'^(s) of the order A; on a given string s is defined as follows. 
The vertex sot V of D''{s) is defined as the A;— spectrum of s (i.e.. V = §(s, k)). 
We use the notation suf{vi, l){pre{vi, I)) to denote the suffix(prefix) of length I 
in string Vi. The symbol . denotes concatenation between two strings. Finally the 
set of directed edges E of D^{s) is defined as follows E = {(vi,Vj)\suf(vi,k — 
1) = pre{vj,k — 1) A Vi[l].suf{vi,k — l).Vj[k] e S{s,k + 1)}. We can further 
generalize the definition of a de Bruijn graph B''{S) on a set 5 = {si,S2 ■ ■ ■ ««} 
of strings, V — U"^i§(si, A;) and E = {{vi, Vj)\suf{vi, k—l)= pre{vj,k—l)A3 1 : 
Vi[l].suf{vi,k- l).Vj[k] e S{si,k + 1)}. 

To model the double strandedness of the DNA molecules we should also 
consider the reverse compliments {S = {si, S2 . . . sTi}) while we build the de 
Bruijn graph. To address this a bi-directed de Bruijn graph BD''{S[JS) has been 
suggested in [6] . The set of vertices V of BD'' {S U S) consists of all the possible 
A;— molecules from E''. For every A; + l— mer zGSUS,iix,y are the two A;— mer's 
of z then an edge is introduced between the A;— molecules (yi,Vj) corresponding 
to X and y. The orientations of the arrow heads on the edges is chosen as follows. 
If both X, y are the positive strands in Vi, Vj an edge (vi, Vj, \>, >) is introduced. 
If a; is a positive strand in Vi and y is a negative strand in vj an edge {vi,Vj,[>,<) 
is introduced. Finally if a; is a negative strand in Vi and y is a positive strand in 
Vj an edge {vi,Vj,<,\>) is introduced. 

3 Problem Definitions 

A Chinese Postman walk in a bi-directed graph is a bi-directed walk which visits 
every edge at least once. A cyclic Chinese Postman walk of minimum cost on a 
weighted bi-directed graph is denoted as CPW. The problem of finding a CPW 
is referred to as CPP. The problem of finding a CPW on an un-weighted bi- 
directed de Bruijn graph (of order A;) constructed from a set of reads is called 
the Shortest Double stranded DNA string (SDDNA) problem. In this paper we 
give algorithms for the cyclic CPP and SDDNA problems. 

4 Single Source Shortest Path Algorithm on a Bi-directed 
de Bruijn Graph 

We first present an algorithm for the single source shortest path problem on a 
bi-directed de Bruijn graph. The bi-directed de Bruijn graph in the context of 
sequence assembly has non- negative weights on the edges. This makes it possible 
to extend the classic Dijkstra's single source shortest path algorithm to these 
graphs. In our algorithm we attach two labels for each vertex in the bi-directed 
graph. Given a source vertex s, the algorithm initializes all the labels similar to 
Dijkstra's algorithm. In each stage of the algorithm a label with the smallest cost 
is picked and some of labels corresponding to adjacent nodes are updated. The 
only major difference between Dijkstra's algorithm and our algorithm is the way 
we update the labels. Dijkstra's algorithm updates all the labels/nodes which 
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are adjacent to the smallest label/node currently picked. However our algorithm 
updates only those labels/nodes which are consistent with the bi-directed walk 
property. 

We now give details of our algorithm and prove its correctness. Let G = 
(V, E) be the bi-directed graph of interest. Also let s be the source and t be the 
destination. We are interested in finding a shortest bi-directed walk from s to t. 
We introduce two labels dist~^[u], dist~ [u] for every vertex u & V . The algorithm 
first initializes labels corresponding to the source (i.e. dist+[s] and dist~[s\) to 
zero. Along with this labels of all the nodes adjacent to s are also initialized 
with the corresponding edge weight. The orientation of the edge determines the 
label we use for initialization. For instance, if {s,v) is a bi-directed edge with 
> — < as the orientation, the label dist~[v] is initialized with Ws^v On the other 
hand dist~[v\ is left uninitialized. In contrast, if the orientation of the edge is 
t> — > then dist^v] is initialized to Ws,v and dist~[v\ is left uninitialized. All the 
uninitialized labels contains oo by default. 

In each iteration of the algorithm a label with the minimum cost is picked. 
Since we have two types of labels, the minimum label can come from either disf^ 
or dist~ . In the first case let be the node corresponding to the minimum label 
during the iteration. This intuitively means that we have a path from s to m+ and 
the orientation of the edge adjacent to u+ in this path is either < — [>or[> — >- 
we are going to prove this fact later in the correctness. On the other hand if u+ 
is different from the destination t, then m+ may possibly appear as an internal 
node in the shortest bi-directed walk between s and t. In this case the path 
through should satisfy bi-directed walk constraint. Thus wc should explore 
only those node(s) adjacent to with an edge(s) orientated as > — < or i> — i>. 
The orientation of the edge determines the type of the label we need to update 
- similar to the label initialization. For instance let be an edge adjacent 

on w+ with a orientation of > — <1. In this case we should use label dist^[v] 
to make an update. Similarly if the orientation of the same edge is [> — > then 
dist^[v] is used in the update process. Consistent with the classical terminology 
of the Dijkstra's algorithm, we refer to the minimum cost label picked in each 
iteration as the permanent label. For instance if a label dist~ [v] is picked to be 
the minimum label in an iteration then we call dist[v] as the permanent label of 
node V. Now to prove the correctness of the algorithm. It is sufficient to show 
that the cost on the permanent label of a node in each iteration is the weight of 
the shortest bi-directed path from s to that node. 

Theorem 1. The permanent label of a node u ^ V in each iteration of Algo- 
rithm 1 is the weight of the shortest bi-directed path from s to u. 

Proof. We prove the statement by induction on the number (n) of iterations in 
Algorithm 1. We now prove the base case when n = 1. Since we have initialized 
dist^[s] = dist~[s] = and the values of the remaining both initialized and 
uninitialized nodes are > 0; the first iteration picks s and zero is trivially the 
cost of shortest bi-directed path form s to s. 
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Algorithm 1: Algorithm to find the shortest bi-directed path from s to t 

INPUT : Bi-directed graph G = {V, E) and two vertices s,t€V 
OUTPUT: Cost of the shortest bi-directed path between s and t 



disf^ls] = dist [s\ = 

dist^lv] = dist~ [v\ = oo '^v £ V /\ V ^ s 

while disf^ ^ (j) or dist~ ^ 4> do 
■a^ — mm.u{dist'^} 
u~ = vcmLu{dist~} 

if u'^ = t or u~ = t then 
|_ return mm{dist'^[u^],dist~[u~]} 



if dist'^lu'^] < dist [u ] then 

U+ = {v\{u+,v) e E A{0)iu+,v) 
U- = {v\{u+,v) e E /\ {0){u+,v) 
dist[u'^] = dist'^lu'^] 
dist^ = disf^ — {«^} 

else 

U+ = {v\{u-,v)£EA(0){u-,v) 
U- = {v\{u-,v) G E A {0){u+,v) 
dist[u^] = dist^[u^] 
dist~ = dist~ — {u~} 



<-<)}} 
<->)}} 



> ■ 

> ■ 



<)}} 
>)}} 



foreach u £ dist^ do 
|_ disf^ [u] = min{dist^ [u] , disf^ [«"*"] + w[u^ , u] } 

foreach u G dist~ do 
|_ dist~[w] = mhi{dist~[u\,dist~[u~\ + w[u~ ,v\} 



30 

31 return oo 
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(a) (b) 



Fig. 1. (a) node 4 contains two bi-directed walks from node 1, the green colored path 
is the shortest. (b) the walk starting from node 1 and ending at node 1 is a Chinese 
walk but not a cyclic Chinese walk. 



Assume that the statement is true for n — I . . .k. As per the induction 
hypothesis the permanent labels dist[s],dist[vi2] ■ ■ -distlvif,] correspond to the 
costs of the shortest bi-directed paths between s and s,Wi2 . . .Vi^. 

Now let dist'{vi^^^\ < dist[vi^j^^] be the cost of the shortest bi-directed walk 
from s to Vi^^^ . Also let s,Vj^ ... Vj^. , Vi^^^ be the path corresponding to the cost 
dist'[vi^,j^^]. Note that Vj^ cannot be one of the nodes with a permanent label. If 
not, we would have dist'[vi^j^^\ = dist\vi^j^^] (because we should have updated 
Vk+i when the Vj^, was given a permanent label) which is a contradiction. Now 
let dist'[vj^] be the cost of the shortest path from s to Vj^. Clearly, dist'[vj^] < 
dist\vi^^-^] and this means that none of the nodes Vj^ , Vj^ . . . Vj^, haa a permanent 
label. Since in the iteration n = 1 the algorithm updated the labels adjacent to 
all the nodes this means that either disf^lvj^] or dist^[vj^] should have a cost 
< Wsj2 s-nd dist'[vij^^-^] > u)sj2 - In each iteration from n = 1, . . . , (fc -I- 1) we 
picked the globally minimum label dist[vi^_^-^] < Wsj2 ^ dist'lvi,^^^] which is a 
contradiction. □ 



We now give a simple example to illustrate the algorithm. Consider the bi- 
directed graph in Figure 4(a), with a unit weight on every edge. Let s — 1 
and i = 4 for instance. From Figure 4(a) we see two bi-directed walks - red, 
green. The green path is the shortest path of length 4 units. Now let us run our 
algorithm on this graph. The algorithm starts with initializing from s = 1 and 
from then on Table 3 shows how the dist labels are updated until iteration 7 
where we reach the target node 4 and stop the algorithm. 
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5 Terminal Oriented Shortest Bi-directed Walks 

In the previous section we have seen how to find a shortest bi-directed walk be- 
tween two nodes in a given bi-directed graph. We now define a terminal oriented 
bi-directed walk as follows. Let w{vi,Vj) — Vi, e^j , Vi^ , , (,',;^ . . . Vi^ , ej^^j , Vj be 
any bi-directed walk between two nodes Vi and Vj in a bi-directed graph. Then 
this bi-directed walk w{vi,Vj) is called terminal oriented bi-directed walk iff 
eij.Oi — > and e^^^ .02 = >. For example in Figure 4(a) there arc two bi- 
directed walks between nodes 4 and 1 - marked with green and red. However 
only the green bi-directed walk is terminally oriented. A terminal oriented bi- 
directed walk w is called the shortest terminal oriented bi-directed walk iff there 
is no other terminal oriented bi-directed walk shorter than w. 

5.1 An algorithm for finding a terminal oriented shortest 
bi-directed walk 

It is easy to modify Algorithm 1 to find a terminal oriented shortest path between 
s and t. We only have to modify the initialization step and the step which 
checks if the target node has been reached. During the initialization at line 
2 of Algorithm 1 we make dist~^[s] = and dist~[s] = 00. This avoids the 
exploration of bi-directed walks which docs not start with >. In line 9, we stop 
our exploration only if u+ = t. These changes ensure that the bi-directed walk 
at s starts with > and ends with > at t. 

6 A Sufficient Condition for an Eulerian Tour on a 
Bi-directed Graph 

The following Lemma 1 [6] is a sufficient condition for a cyclic Eulerian tour in a 
bi-directed graph. A bi-directed graph which has a cyclic Eulerian tour is called 
an Eulerian bi-directed graph. 

Lemma 1. A connected bi-directed graph is Eulerian if and only if every vertex 
is balanced. 

Note that if a bi-dircctcd graph is Eulerian then a cyclic CP walk is the same 
as a cyclic Eulerian walk. We emphasize the cyclic adjective for the following 
reason. Figure 4(b) has a CP walk starting and ending at vertex 1. However 
the CP walk is not cyclic because the walk starts with > and ends with <l. The 
bi-directed graph in Figure 4(b) is not balanced. If the bi-directed graph is not 
Eulerian, the key strategy to find a cyclic CP walk is to make it Eulerian by 
introducing multi-edges into the original graph. The hope is that introducing 
multi-edges would make the bi-directed graph balanced. Thus a cyclic Eulerian 
walk on a balanced multi-edge bi-directed graph would give a cyclic CP walk 
on the original graph. Since wc arc interested in finding a shortest cyclic CP 
walk, we would like to minimize the number of multi-edges we introduce in the 
original graph. 
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(a) (b) 

Fig. 2. (a) a simple bi-directed graph, (b) a multi-bi-directed graph. Notice that ori- 
entations of the multi-edges is the same as the orientation of the original edge. 

7 A Deterministic Algorithm to Find a Cyclic CP Walk 
on a Bi-directed Graph 

We now describe our deterministic algorithm to find a cyclic CP walk on a 

weighted bi-directed graph. First we define a multi-bi-directed graph as a bi- 
directed graph in which an edge between two nodes is overlaid at least once, 
without changing its orientation. Figure 6(a) shows a bi-directed graph; Fig- 
ure 6(b) shows a valid multi-bi-directed graph. Notice that while overlaying 
the edge we did not change its orientation. Since the orientation of the multi- 
edges is same as the original edges, any bi-directed walk involving multi-edges 
is consistent with the bi-directed walk in the original graph. Another important 
property of the multi-bi-directed graphs is their ability to make the nodes bal- 
anced. Notice that the vertex 3 in the original bi-directed graph is positively 
imbalanced - diniv^) = 2,dout{v3) = 1. However in the multi-bi-directed graph 
in Figure 6(b) we are able to balance vertex 3 by introducing some multi-edges 
into the original graph. Given a bi-directed graph G = (V, E), let G"* = {V, £™) 
be some multi-bi-directed graph corresponding to G. The following Lemma 2 
gives a characterization for G to have a cyclic CP walk. 

Lemma 2. A non Eulerian bi-directed graph G = {V, E) has a cyclic Chinese 
Postman walk <^=^ 3 a corresponding multi-bi-directed graph G™ = {V,E"^) 
which is Eulerian. 

Proof. If G has a cyclic Chinese Postman walk, introduce a unique multi-edge 
in for every repeated edge in the cyclic Chinese Postman walk. This makes 
the cyclic Chinese Postman walk on the original graph a cyclic Eulcrain walk on 
the multi-bi-dirccted graph G™. Proving the other direction is very similar. □ 

Given a multi-bi-directed graph G™ {V, £"") corresponding to some bi-directed 
graph G = {V,E), we define the multi-bi-directed graph weight as >V(G™) = 
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c(e), where c : e £ E ^ M+ is a cost function on the bi-dircctcd graph 

G{V, E). We denote G* {V, E*) as the minimum weight Eulerian multi-bi-directed 
graph corresponding to GiV, E) if at all one exists. The following Lemmas are 
easy to prove. 

Lemma 3. Finding a cyclic CP walk on a bi-directed graph G{V, E) is equiv- 
alent to finding a minimum weight Eulerian multi-bi-directed graph G*{V,E*) 
corresponding to G. 

Lemma 4. // a bi-directed- graph G(V, E) has a cyclic CP walk then the cost of 
that walk is equal to the weight ofG*{V,E*). 

7.1 Balancing bi- partite graph 

Given a bi-directed de Bruijn graph G{V, E) we define a corresponding Balanc- 
ing Bi-partite Craph, B{P, Q, E'') as follows. Let = {v\ din{v) — dout{v) > 0}, 
V- = {v\dUv)^d,M < 0}. P = Upey+{pW,p(2)...p(M..(P)-rfo„*(p)l)}^ 
Q = Uqgy-{g(i\(?(2) . . . q(.\<iin{g)-dautiq)\)y We now introduce an edge between 
p^i) £ P and q'^j) G Q iff p, g S F are connected by a terminal oriented bi- 
directed walk from p to q. Let dist^ {p, q) be the weight of this walk. Then E^ = 
{{p'^^,q'^^^dist\p,q) ^ oo A p,q&V}. The weight of the edge {p^'\q'^^^) € E'' 
is the weight of terminal oriented bi-directed walk dist*{p, q). 

Lemma 5. A nan Eulerain bi-directed graph G(V, E) has a cyclic CP walk 
the balancing bi-partite graph B{P, Q, E^) has a perfect match. 

Proof. (Forward direction:) Since G has a cyclic CP walk, every un-balanced 
node V € ^(positive or negative) should appear at least i > \din{v) — dout{v)\ 
times. Label each occurrence of v in the cyclic CP walk by v^'^\ Also note that 
Y.pev+ \d-in{p) - doutip)\ = Y.qev- \din{q) - douL{q)\, sincc G has a cyclic CP 
walk. Now we can pair every i"' occurrence of a positively imbalanccd node p 
to some j*'^ occurrence of a negatively imbalanced node q since p(') and q^^^ are 
connected by a terminal oriented bi-directed walk in the cyclic CP walk. Every 
such pairing corresponds to a matched edge in B{P, Q, E^). 

(Reverse direction:) Consider the perfect match Mb in B{P,Q,E^). For every 
edge (pW,^^)) e Mb over-lay the underlying oriented bi-directed walk between 
p,q & V on G{V,E). This makes G{V,E) a balanced multi-bi-directed graph. 
Then by Lemma 2 we can construct a cyclic CP walk in G. □ 

7.2 Constructing a family of Eulerian multi-bi-directed graphs 

We now give a construction for generating Eulerian multi-bi-directed graphs 
corresponding to a given non Eulerian bi-directed graph which has a cyclic CP 
walk. We call this a Balancing Match Family denoted by J^. Lemma 5 can be used 
to generate Assume that G{V, E) is a non Eulerian bi-dircctcd graph that 
has a cyclic CP walk. The following construction generates a family of Eulerian 
multi-bi-directed graphs corresponding to G{V,E). 
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— STEP-1: Create a balancing bi-partite graph B{P,Q,E"'-) corresponding to 
G{V, E) by choosing some terminal oriented bi-directed walk between p^'^ e 
P and g(^) e Q. 

— STEP-2: Find a perfect match Mb in B{P,Q,E"'). For each edge in Mb 
overlay the corresponding terminal oriented bi-dircctcd walk on G{V,E). 
This generates a Eulerian multi-bi-directed graph G"^{V,E"^). 

The following Lemma 6 is easy to see. 

Lemma 6. // G{V, E) is a non Eulerian bi-directed graph that has a cyclic CP 
walk, then every corresponding Eulerian multi-bi-directed graph G"^{V, E"^) be- 
longs to the family T . 

The following Lemma gives an expression for the weight of any G'"(y, i?™) e 

T. 

Lemma 7. Let Giy, E, c) be a non Eulerian weighted bi-directed graph which 
has a cyclic CP walk c : E ^ R"^. Let G""(V, E"^, c) <E J- be som,e Eulerian multi- 
bi-directed graph. Then, W(G'") = ^c(e) + ^ dist*{p,q), where Mb 

eeE (p(*),gO))eM6 

is a perfect match in B{P, Q, E^). 

Proof. Since G™ is Eulerian it should cover every edge in G - this corresponds to 
the first term. Secondly, since G'"(V, E"^,c) S the cost of multi-edges coming 
from overlaying the terminal bi-directed walk corresponds to the match Mb in 
B{P, Q, E^). This corresponds to the second term. □ 

7.3 An algorithm for finding an optimal cyclic CP walk 

We now put together all the results in the preceding sub-section(s) to give an 
algorithm to find G* (V, E*). The algorithm is summarized in the following steps. 

— STEP-1: We first identify positive and negative imbalanced nodes in G. Let 

V+ = {v\dUv) - doutiv) > 0}, V- = {v\d,,a{v) - doutiv) < 0} 

— STEP-2: Find the cost of a terminal oriented shortest bi-directed walk be- 
tween every pair {v, u) € V"*" x V~ . Let this cost be denoted as dist^{v, u). 

— STEP-3: Create a balancing bi-partite graph B{P, Q, E^) as follows. Let P ~ 
U„ev+ W'\v^^\. . . , }, Q = u„^v_ {u(^),u, 

E = e P a w(^) e Q}. The cost of an edge c(wW,u(^)) = 

dist*{v, u). 

— STEP-4: Find a minimum cost perfect match in B. Let this match be Mb- If 
B docs not have a perfect match then G docs not have a cyclic CP walk. 

— STEP-5: For each edge (v^'^^u^^^) G Mb , overlay the terminal oriented short- 
est bi-directed walk between v and u in the G{V,E). After overlaying all 
the terminal oriented bi-directed walks from Mb on to G{V, E) we obtain 
G*(V, E*). We will prove that it is optimal in Theorem 2. 
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Theorem 2. IfG{V, E) is a bi-directed graph that has a cyclic CP walk, then the 
cost of this cyclic CP walk is equal to W{G*) = ^ c(e)+ ^ dist*{v,u). 

eeE {v('Ku(jy)eAh 

Here Mj, is the min-cost perfect match in the balancing bi-partite graph B. 

Proof. By Lemma 6 the multi-bi-directed graph G* (V, E* ) belongs to F. Now by 
Lemma 7, any optimal solution has to minimize the second term qii) dist*{p, q)). 

To minimize this the algorithm chooses shortest terminal oriented bi-directed 
walk in STEP-2. Finally, in STEP-5 the algorithm finds a minimum cost perfect 
match. Both these steps ensure that W(G'*) is minimum in the entire family of 
multi-bi-directed graphs in J^.D 

7.4 Runtime analysis of the algorithm to find a cyclic CP walk 

Let p = max{|F+|, \V^\} and d^ax = max{\din{v) — dout{v)}- STEP-2 of the 

algorithm runs in ©(pdV^I + log(|F|)) time to compute dist*{v, u). In STEP-3 
\P\ < dmaxP J IQI < djnaxP- For STEP-4 Hungarian method can be applied to 
solve the weighted matching problem in 0{{dmaxPY) time. So the total runtime 
of this deterministic algorithm is ©(pd^l + log(|\^|) + {dmaxP)^)- As men- 
tioned before if p is much smaller than \V\ this algorithm performs better than 
the bi-directed flow algorithm. 

7.5 Runtime analysis of the algorithm to find SDDNA 

Since SDDNA runs on a bi-directed de Bruijn graph which is un-wcightcd, STEP- 
2 of the algorithm runs in -|- \E\)) time - because we don't need to use 

a Heap, we just do a BFS on the bi-directed graph. The rest of the analysis 
for the rimtimc remains the same and the total run time of the algorithm is 

e{p{\v\ + \E\) + {d^^^pf). 

8 A @{p{\V\ + \E\)) Time Heuristic Algorithm for the 
SDDNA Problem 

From the analysis in the previous section, to solve the SDDNA problem deter- 
ministically we need to spend 0{p{\V\ + \E\)-\-{dmaxP)^) time. However if we just 
replace the Hungarian method in STEP-4 with a simple greedy algorithm we can 
get rid of the {dmaxPY term in the asymptotic complexity. Although we have a 
constant 2/3 — e approximation algorithm for maximum weighted matching, we 
are not aware of any constant approximation algorithms for minimum weight 
perfect matching. As a result this just remains as a heuristic algorithm. On the 
other hand this algorithm seems to be performing very close to the optimal (see 
Section 10.2). 
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9 Dealing with Practical Bi-directed de Bruijn Graphs 
with no Cyclic CP Walks 

As wc have mentioned earlier most of the bi-directed de Bruijn graphs con- 
structed from the reads do not satisfy the sufRcient condition for cycUc CP 
walks. In such cases our algorithm can still be used, by modifying it to find a 
maximum, match in the balancing bi-partite graph rather than perfect match. 
We can introduce a hypothetical node h and connect all the un-matched nodes 
in the balancing bi-partite graph to h with appropriate bi-directed edges and 
thus make all the original nodes balanced. We can now find a cyclic CP walk in 
this hypothetical graph. Every sub-walk in the cyclic CP walk that starts from 
h and ends at h can be reported as a contig. Thus our algorithm is capable of 
handling cases when the bi-directed graph cannot have a cyclic CP walk. 



10 Experimental Results 

As we have mentioned in the previous sections the asymptotic complexity of our 
algorithm depends on p - the maximum of positively and negatively imbalanced 
nodes. In the case of de Bruijn graphs dmax < where \S\ is the size the 
alphabet from which the strings are drawn. In our case this is exactly four. So 
we can safely ignore dmax in the case of de Bruijn graphs and just concentrate 
on p. In the rest of the discussion we would like to refer to p as the number of 
imbalanced nodes. 



10.1 Estimation of the mean of the random variable 

It is clear that p is a random variable with support in [0, So we would like 
to estimate the expected number of imbalanced nodes in a graph with |V"| bi- 
directed edges. We estimated the mean of the random variable from several 
samples of bi-directed de Bruijn graphs constructed from reads from a plant 
genome. A simple test is applied to to estimate the 95% confidence interval of 
1^1 . See Table 1 for the details of the samples used. Notice that as we increase the 
size of k (de Bruijn graph order) from 21 to 25, the number of imbalanced nodes 
in columns corresponding to \V~^\ and \V~\ reduces. This is because increasing 
k reduces the number of edges which may reduce the number of imbalanced 
nodes. On the other hand for a fixed value of k the number of imbalanced nodes 
increases consistently with the nodes. However the rate of growth is very slow 
compared to the rate of growth of the number of nodes. Finally we use this 
evidence to hypothesize that the number of imbalanced nodes in practical bi- 
directed graphs is only between 0.087% to 0.133% of the number of nodes in the 
graph, with a probability of 95%. 
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10.2 Performance of the greedy heuristic and handUng cases which 
do not have cycUc CP walks 

The greedy heuristic described in Section 8 has been compared with the opti- 
mal maxim,um match with minim,um cost. As we mentioned in Section 9 many 
of these graphs do not contain cycUc CP walks so they do not have a perfect 
match. To cope with this situation we treated the balancing bi-directed graph 
as a complete bi-dircctcd graph, by introducing a hypothetical edge with large 
cost whenever there is no edge between two nodes in the original graph. Thus 
we just used the size of the match to compare the cost of the greedy algorithm 
and the optimal algorithm to solve the matching problem. Table 2 gives the de- 
tails of the balancing bi-partite graph obtained from several read samples. As we 
have mentioned before, to get the approximation ratio we treated a balancing 
bi-partite graph as a complete bi-partite graph Kp^p. If Mgpi and Mgdy are the 
sizes of maximum and maximal matches then we treated the cost of hypothet- 
ical perfect match as [p — \Mopt\) and [p — \Mg4y\) and their ratio is used as 
approximation ratio. Finally from the evidence in Table 2 we hypothesize that 
the approximation ratio for this is between 1.008 and 1.016 with a probability 
of 95%. 

10.3 Implementation and Data 

An implementation of the algorithms discussed is available at http: //trinity, 
engr . uconn . edu/~vamsik/f ast_cpp . tgz. 

11 Conclusion and further research 

In this paper we have given an algorithm for cyclic Chinese Postman walk on a 
bi-directed de Bruijn graph. Our algorithm is based on identifying shortest bi- 
directed walks and weighted matching. This algorithm performs asymptotically 
better than the bi-dircctcd flow algorithm when the number of imbalanced nodes 
are much smaller than the nodes in the bi-directed graph. On the other hand 
this algorithm can also handle the instances of bi-directed graphs which does not 
have a cyclic CP walk and provide a minimal set of walks, cyclic walks which 
cover every edge in the bi-directed graph at least once. 

There are several research directions which can be pursued. Firstly, we need 
to address how the addition of paired reads may impose new constraints on the 
cyclic CPP walk. Secondly, while Eulerization of the bi-directed graph we have 
chosen the shortest path bi-directed path, however this may not correspond to 
the repeating region in the genome. Other strategy to make the graph Eulerian 
is to choose the path with maximum read multiplicity. This on other hand may 
increase the length of the Chinese walk, can we simultaneously optimize these 
two objectives ?. 

Acknowledgements. This work has been supported in part by the following 
grants: NSF 0326155, NSF 0829916 and NIH 1R01GM079689-01A1. 
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READS 


k 


NODES 


P-IMBAL 


N-IMBAL 


BAL-BI-GRAPH 










\V+\ 


\v-\ 




\Q\ 


P 


pxLUU 

\v\ 


102 100 


21 


l.")885()9 


1157 


ll;i;i 


1186 


1173 


1186 


0.075 


153600 


21 


2353171 


2240 


2141 


2298 


2211 


2298 


0.098 


204800 


21 


3097592 


3509 


3492 


3601 


3590 


3601 


0.116 


256000 


21 


3825101 


4953 


5004 


5074 


5131 


5131 


0.134 


307200 


21 


4538734 


6719 


6748 


6878 


6912 


6912 


0.152 


358400 


21 


5235821 


8586 


8603 


8789 


8802 


8802 


0.168 


409600 


21 


5917489 


10665 


10693 


10914 


10934 


10934 


0.185 


102400 


25 


1202962 


569 


521 


588 


540 


588 


0.049 


153600 


25 


1788533 


1104 


1026 


1139 


1062 


1139 


0.064 


204800 


25 


2362981 


1744 


1708 


1788 


1759 


1788 


0.076 


256000 


25 


2927656 


2521 


2523 


2579 


2592 


2592 


0.089 


307200 


25 


3484849 


3370 


3414 


3451 


3517 


3517 


0.101 


358400 


25 


4032490 


4333 


4369 


4441 


4485 


4485 


0.111 


409600 


25 


4571554 


5390 


5467 


5518 


5613 


5613 


0.123 



^ - , X + -g_a : 95% C.I for average is [0.0872%, 0.1330%] 



Table 1. The value of p on short read data from a plant genome sequencing data from 
CSHL 



READS 


k 


NODES 


max|P|,|g| 


OPT 


GRDY 


APX-RATIO 










SIZE 


COST 


SIZE 


COST 








1^1 


P 


\Mopt\ 


P- \Mopt\ 


\Mgdy\ 


P-\Mgdy\ 


OPT 


102400 


21 


1202962 


1186 


416 


770 


406 


780 


1.0130 


153600 


21 


1788533 


2298 


725 


1573 


704 


1594 


1.0134 


204800 


21 


2362981 


3601 


1092 


2509 


1073 


2528 


1.0076 


256000 


21 


3825101 


5131 


1479 


3652 


1450 


3681 


1.0079 


307200 


21 


4538734 


6912 


1929 


4983 


1879 


5033 


1.0100 


358400 


21 


5235821 


8802 


2385 


6417 


2329 


6473 


1.0087 


409600 


21 


5917489 


10934 


2876 


8058 


2814 


8120 


1.0077 


102400 


25 


1202962 


588 


1.52 


436 


147 


441 


1.0115 


153()()() 


25 


17885;i:i 


il:i9 


281 


858 


271 


865 


1.0082 


204800 


25 


2362981 


1788 


438 


1350 


429 


1359 


1.0067 


256000 


25 


2927656 


2592 


619 


1973 


601 


1991 


1.0091 


307200 


25 


3484849 


3517 


809 


2708 


783 


2734 


1.0096 


358400 


25 


4032490 


4485 


995 


3490 


966 


3519 


1.0083 


409600 


25 


4571554 


5613 


1220 


4393 


1175 


4438 


1.0102 


[x , X + a : 95% C.I for average qpt'cqst ^ [1-0083%, 1.0106%] 



Table 2. Approximation ratio of the GDY heuristic 
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Appendix 







iteration: 1 , min e 


= dist~ [2] = 1 










dist+[2] 


= oo 


dist+[3] = oo dist+[A] 


= oo disf^ [5] 


= oo 


disf^ 


[6] 


= oo 


dist~[2] 


= 1 


dist'[3] — oo dist~[4] 


— oo dist~ [5] 


= 00 


dist- 


[6] 


= oo 






iteration:2 , min e 


= dist- [5] = 2 










dist+[2] 


= oo 


dist+[3] = oo dist+[4] 


= oo dist''' [5] 


= oo 


disf^ 


[6] 


= oo 


dist~[2] 


= 1 


dist-[3] = 2 fi?:6-i"[4] 


= oc distr [5] 


= 2 


dist- 


[6] 


= oo 






iteration:3 , min e 


= dist" [3] = 2 










dist+[2] 


= oo 


dist+[3] = 3 dist+[A] 


= oo dist^ [5] 


= oo 


disf^ 


[6] 


= oo 


dist~[2] 


= 1 


dist-[3] = 2 distr[4] 


= oc dist~ [5] 


= 2 


dist- 


[6] 


= oo 






iteration:4 , min e 


= dist~ [6] = 3 










dist+[2] 


= oo 


dist+[3] = 3 dist+[A] 


= oo dist^ [5] 


= 3 


disf^ 


[6] 


= oo 


dist~[2] 


= 1 


dist~[3] = 2 distr[4] 


= oc d'istr [5] 


= 2 


dist- 


[6] 


= 3 






iteration:5 , min e 


= dist~ [5] = 3 










dist+[2] 


= oo 


dist+[3] = 3 dist+[4] 


= oo dist^ [5] 


= 3 


disf^ 


[6] 


= oo 


dist-[2] 


= 1 


dist-[3] = 2 disfrlA] 


= 4 dist- [5] 


= 2 


dist- 


[6] 


= 3 






iteration:6 , min e 


= dist~ [3] = 3 










dist+[2] 


= 4 


dist+[3] = 3 dist+[4] 


= oo disf^ [5] 


= 3 


disf^ 


[6] 


= oo 


dist~[2] 


= 1 


dist~[3] = 2 distr[4] 


= 4 dist~ [5] 


= 2 


dist- 


[6] 


= 3 






iteration:? , min e 


= dist- [4] = 4 










dist+[2] 


= 4 


dist+[3] = 3 dist+[4] 


= oo disf^ [5] 


= 3 


disf^ 


[6] 


= 00 


dist-[2] 


= 1 


dist-[3] = 2 dist-[4] 


= 4 dist- [5] 


= 2 


dist- 


[6] 


= 3 


The shortest path from node 


1 to node 4 is 


of length 4 







Table 3. Changes in the dist and dist labels in every iteration of Algorithm 1 on 
the bi-directed graph in Figure 4 
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